Lung segmentation in chest X-rays (CXRs) is an important prerequisite for improving the specificity of diagnoses of cardiopulmonary diseases in a clinical decision support system. Current deep learning (DL) models for lung segmentation are trained and evaluated on CXR datasets in which the radiographic projections are captured predominantly from the adult population. However, the shape of the lungs is reported to be significantly different for pediatrics across the developmental stages from infancy to adulthood. This might result in age-related data domain shifts that would adversely impact lung segmentation performance when the models trained on the adult population are deployed for pediatric lung segmentation. In this work, our goal is to analyze the generalizability of deep adult lung segmentation models to the pediatric population and improve performance through a systematic combinatorial approach consisting of CXR modality-specific weight initializations, stacked generalization, and an ensemble of the stacked generalization models. Novel evaluation metrics consisting of Mean Lung Contour Distance and Average Hash Score are proposed in addition to the Multi-scale Structural Similarity Index Measure, Intersection of Union, and Dice metrics to evaluate segmentation performance. We observed a significant improvement (p < 0.05) in cross-domain generalization through our combinatorial approach. This study could serve as a paradigm to analyze the cross-domain generalizability of deep segmentation models for other medical imaging modalities and applications.
translated by 谷歌翻译
使用深度学习方法(DL)方法的结核病(TB)自动分割(TB) - 一致的病变(CXR)可以帮助减少放射科医生的努力,补充临床决策,并有可能改善患者治疗。文献中的大多数作品使用粗边界框注释讨论培训自动分割模型。但是,边界框注释的粒度可能导致在像素级别上包含相当一部分假阳性和负面因素,从而可能对整体语义分割性能产生不利影响。这项研究(i)评估了使用TB一致性病变的细粒注释和(ii)U-NET模型变体的培训和构造的好处CXR。我们使用多种集合方法(例如位和位或位,位 - 最大值和堆叠)评估了分割性能。我们观察到,与单个组成模型和其他集合方法相比,堆叠合奏表现出优异的分割性能(骰子得分:0.5743,95%置信区间:(0.4055,0.7431))。据我们所知,这是第一个应用合奏学习来改善细粒度元素一致性病变细分性能的研究。
translated by 谷歌翻译
组织病理学分析是对癌前病变诊断的本金标准。从数字图像自动组织病理学分类的目标需要监督培训,这需要大量的专家注释,这可能是昂贵且耗时的收集。同时,精确分类从全幻灯片裁剪的图像斑块对于基于标准滑动窗口的组织病理学幻灯片分类方法是必不可少的。为了减轻这些问题,我们提出了一个精心设计的条件GaN模型,即hostogan,用于在类标签上合成现实组织病理学图像补丁。我们还研究了一种新颖的合成增强框架,可选择地添加由我们提出的HADOGAN生成的新的合成图像补丁,而不是直接扩展与合成图像的训练集。通过基于其指定标签的置信度和实际标记图像的特征相似性选择合成图像,我们的框架为合成增强提供了质量保证。我们的模型在两个数据集上进行评估:具有有限注释的宫颈组织病理学图像数据集,以及具有转移性癌症的淋巴结组织病理学图像的另一个数据集。在这里,我们表明利用具有选择性增强的组织产生的图像导致对宫颈组织病理学和转移性癌症数据集分别的分类性能(分别为6.7%和2.8%)的显着和一致性。
translated by 谷歌翻译
胸部X射线(CXR)是一种广泛执行的放射学检查,有助于检测胸腔中组织和器官的异常。检测像Covid-19这样的肺异常可能变得困难,因为它们被像肋和锁骨一样的骨结构的存在模糊,从而导致筛选/诊断误解。自动骨抑制方法有助于抑制这些骨结构并提高软组织可见性。在本研究中,我们建议建立卷积神经网络模型的集合,以抑制正面CXR中的骨骼,提高分类性能,并减少与Covid-19检测相关的解释误差。该合奏由(i)构造(i)测量由前3个执行骨抑制模型和相应子的每个前3个预测的骨抑制图像的子块之间的多尺度结构相似性指数(MS-SSIM)得分 - 其各自的地面真相软组织图像,(ii)执行在每个子块中计算的MS-SSIM分数的大多数投票,以识别具有最大MS-SSIM分数的子块并在构造中使用它最终的骨抑制图像。我们经验确定了提供卓越的骨抑制性能的子块大小。据观察,骨抑制模型集合在MS-SSIM和其他度量方面表现出各个模型。在非骨抑制和骨抑制的图像上再培训和评估特异性特异性分类模型,以将它们分类为显示正常肺部或其他Covid-19类似的表现形式。我们观察到骨抑制的模型训练显着优于非骨抑制图像训练的模型朝着检测Covid-19表现形式。
translated by 谷歌翻译
医学图像通常表现出多种异常。预测它们需要多级分类器,其培训和期望的可靠性性能可能受到因素的组合而影响,例如数据集大小,数据源,分布以及用于训练深度神经网络的损耗功能。目前,跨熵损失仍然是培训深层学习分类器的脱磁场损失功能。然而,这种损失函数断言所有课程的平等学习,导致大多数类的偏见。在这项工作中,我们基准测试适用于多级分类,重点分析模型性能的各种最先进的损失功能,并提出改善的损失功能。我们选择一个小儿胸部X射线(CXR)数据集,其包括没有异常(正常)的图像,以及表现出与细菌和病毒性肺炎一致的表现形式的图像。我们分别构建预测级别和模型级集合,以提高分类性能。我们的结果表明,与个别模型和最先进的文献相比,前3名和前5个模型级集合的预测的加权平均在术语中提供了显着优越的分类性能(P <0.05) MCC(0.9068,95%置信区间(0.8839,0.9297))指标。最后,我们进行了本地化研究,以解释模型行为,以便可视化和确认个人模型和集合学习有意义的特征和突出显示的疾病表现。
translated by 谷歌翻译
Recent advances in open-domain question answering (ODQA) have demonstrated impressive accuracy on standard Wikipedia style benchmarks. However, it is less clear how robust these models are and how well they perform when applied to real-world applications in drastically different domains. While there has been some work investigating how well ODQA models perform when tested for out-of-domain (OOD) generalization, these studies have been conducted only under conservative shifts in data distribution and typically focus on a single component (ie. retrieval) rather than an end-to-end system. In response, we propose a more realistic and challenging domain shift evaluation setting and, through extensive experiments, study end-to-end model performance. We find that not only do models fail to generalize, but high retrieval scores often still yield poor answer prediction accuracy. We then categorize different types of shifts and propose techniques that, when presented with a new dataset, predict if intervention methods are likely to be successful. Finally, using insights from this analysis, we propose and evaluate several intervention methods which improve end-to-end answer F1 score by up to 24 points.
translated by 谷歌翻译
Answering complex questions that require making latent decisions is a challenging task, especially when limited supervision is available. Recent works leverage the capabilities of large language models (LMs) to perform complex question answering in a few-shot setting by demonstrating how to output intermediate rationalizations while solving the complex question in a single pass. We introduce ``Successive Prompting'', where we iteratively break down a complex task into a simple task, solve it, and then repeat the process until we get the final solution. Successive prompting decouples the supervision for decomposing complex questions from the supervision for answering simple questions, allowing us to (1) have multiple opportunities to query in-context examples at each reasoning step (2) learn question decomposition separately from question answering, including using synthetic data, and (3) use bespoke (fine-tuned) components for reasoning steps where a large LM does not perform well. The intermediate supervision is typically manually written, which can be expensive to collect. We introduce a way to generate a synthetic dataset which can be used to bootstrap a model's ability to decompose and answer intermediate questions. Our best model (with successive prompting) achieves an improvement of ~5% absolute F1 on a few-shot version of the DROP dataset when compared with a state-of-the-art model with the same supervision.
translated by 谷歌翻译
Test log-likelihood is commonly used to compare different models of the same data and different approximate inference algorithms for fitting the same probabilistic model. We present simple examples demonstrating how comparisons based on test log-likelihood can contradict comparisons according to other objectives. Specifically, our examples show that (i) conclusions about forecast accuracy based on test log-likelihood comparisons may not agree with conclusions based on other distributional quantities like means; and (ii) that approximate Bayesian inference algorithms that attain higher test log-likelihoods need not also yield more accurate posterior approximations.
translated by 谷歌翻译
Default implementations of Bayesian Additive Regression Trees (BART) represent categorical predictors using several binary indicators, one for each level of each categorical predictor. Regression trees built with these indicators partition the levels using a ``remove one a time strategy.'' Unfortunately, the vast majority of partitions of the levels cannot be built with this strategy, severely limiting BART's ability to ``borrow strength'' across groups of levels. We overcome this limitation with a new class of regression tree and a new decision rule prior that can assign multiple levels to both the left and right child of a decision node. Motivated by spatial applications with areal data, we introduce a further decision rule prior that partitions the areas into spatially contiguous regions by deleting edges from random spanning trees of a suitably defined network. We implemented our new regression tree priors in the flexBART package, which, compared to existing implementations, often yields improved out-of-sample predictive performance without much additional computational burden. We demonstrate the efficacy of flexBART using examples from baseball and the spatiotemporal modeling of crime.
translated by 谷歌翻译
最近的模型可以产生流利和语法合成评论,同时准确预测用户评分。生成的评论表达了用户对相关产品的估计意见,通常被视为自然语言“理由”,共同预测的评级。但是,先前的研究发现,现有模型通常会产生重复性,普遍适用和通用的解释,从而导致非信息原理。此外,我们的分析表明,以前的模型生成的内容通常包含事实幻觉。这些问题要求采用新颖的解决方案,这些解决方案可以产生信息丰富的和事实扎根的解释。受到最新使用检索内容的启发,除了生成的参数知识外,我们建议用个性化的检索器增强发电机,在该发现者的启发下,猎犬的输出是增强发电机的外部知识。关于Yelp,TripAdvisor和Amazon Movie评论数据集的实验表明,我们的模型可以产生解释,即更可靠地需要进行现有评论,更多样化,并且由人类评估人员评为更有信息。
translated by 谷歌翻译